Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach
نویسندگان
چکیده
Characterizing the functional relevance of transcription factors (TFs) in different biological contexts is pivotal in systems biology. Given the massive amount of genomic data, computational identification of TFs is often necessary to generate new hypotheses for experimentalists. In this paper, we use large gene expression and chromatin immunoprecipitation (ChIP) data corpuses to conduct high-throughput TF-biological context association analysis. This work makes two contributions: (i) From a methodological perspective, we propose a unified topic modeling framework for exploring and analyzing large and complex genomic datasets. Under this framework, we develop new statistical optimization algorithms and semiparametric theoretical analysis which are also applicable to a variety of large-scale data analyses. (ii) From a scientific perspective, our method provides an informative list of new discoveries in biology. Our data-driven analysis of 38 TFs in 68 tumor biological contexts identifies functional signatures of epigenetic regulators, such as SUZ12 and SET-DB1, and nuclear receptors, in many tumor types. In particular, the TF signature of SUZ12 is present in a broad range of tumor types, suggesting the important role of SUZ12-mediated histone methylation in tumor biology.
منابع مشابه
Robust high-dimensional semiparametric regression using optimized differencing method applied to the vitamin B2 production data
Background and purpose: By evolving science, knowledge, and technology, we deal with high-dimensional data in which the number of predictors may considerably exceed the sample size. The main problems with high-dimensional data are the estimation of the coefficients and interpretation. For high-dimension problems, classical methods are not reliable because of a large number of predictor variable...
متن کاملBayesian Data Fusion: a Reliable Approach for Descriptive Modeling of Ore Deposits
Recognition of ore deposit genesis is still a controversial challenge for economic geologists. Here, this task was addressed by the virtue of Bayesian data fusion (BDF) implementing available proofs: semi-schematic examples with two (Cu and Pb + Zn) and three (Cu, Pb + Zn and Ag) evidences. The data, in current paper are just concentrations of indicated elements, were collected from Angouran’s ...
متن کاملInteractive Clustering for Exploration of Genomic Data
The complete genomic sequences for many organisms, particularly primitive organisms with relatively small genomes (prokaryotes), are now available. We describe an approach that supports interactive exploration of patterns in genomic data by combining use of positional weight matrices, the k-means clustering algorithm, and a visualization tool. Users interact with the system by examining a visua...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملD-VITA: A Visual Interactive Text Analysis System Using Dynamic Topic Mining
Recent developments in web technologies like Web 2.0 have led to the generation of massive amounts of data. The rapid growth of data makes knowledge extraction and trend prediction a challenging task. A recent approach for the unsupervised analysis of text corpora is dynamic topic mining. While there is a growing interest in using this technique, interactive analysis systems for dynamic topic m...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015